github header

Telco Customer Churn Project

1 Telco Customer Churn

1.1 Description

The Telco customer churn data contains information about a telephone company that provided home phone and Internet services to 7043 customers in California at the end of 2017 Quarter 3. It indicates which customers have left, stayed, or signed up for their service.
Studying such data can help companies identify the characteristics of lost customers, identify potential, soon-to-be-lost customers and develop appropriate strategies to retain them.
The dataset is WA_Fn-UseC_-Telco-Customer-Churn.csv.

1.1.1 variables

  • gender: Female or Male
  • SeniorCitizen: customer is a senior citizen or not (Yes, No)
  • Partner: customer has a partner or not (Yes, No)
  • Dependents: customer has dependents or not (Yes, No)
  • tenure: number of months the customer has stayed with the company
  • PhoneService: customer has a phone service or not (Yes, No)
  • MultipleLines: customer has multiple lines or not (Yes, No, No phone service)
  • InternetService: customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity: customer has online security or not (Yes, No, No internet service)
  • OnlineBackup: customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection: customer has device protection or not (Yes, No, No internet service)
  • TechSupport: customer has tech support or not (Yes, No, No internet service)
  • StreamingTV: customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies: customer has streaming movies or not (Yes, No, No internet service)
  • Contract: contract term of the customer (Month-to-month, One year, Two year)
  • PaperlessBilling: customer has paperless billing or not (Yes, No)
  • PaymentMethod: Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)
  • MonthlyCharges: amount charged monthly
  • TotalCharges: total amount charged
  • Churn: customer churned or not (Yes or No)

2 Exploratory Data Analysis (EDA)

2.1 What factors influence customer churn?

2.1.1 bar plots for categorical variables

# gender seniorCitizen partner depedents
gender <- ggplot(customer, aes(x=gender, fill=Churn)) +
                  geom_bar(position="fill") +
                  scale_fill_manual(values=c("pink3", "steelblue")) +
                  labs(title="Gender", x="", y="Percentage") +
                  theme(legend.position="top")
senior <- ggplot(customer, aes(x=SeniorCitizen, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Not senior", "Yes" = "Senior")) +
                   labs(title="SeniorCitizen", x="", y="Percentage") +
                   theme(legend.position="top")
partner <- ggplot(customer, aes(x=Partner, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no partner", "Yes" = "Have a partner")) +
                   labs(title="Partner", x="", y="Percentage") +
                   theme(legend.position="top")
dependent <- ggplot(customer, aes(x=Dependents, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no dependents", "Yes" = "Have dependents")) +
                   labs(title="Dependent", x="", y="Percentage") +
                   theme(legend.position="top")
basicInfo <- ggarrange(gender, senior, partner, dependent)
annotate_figure(basicInfo, top = text_grob("Customer Basic Infomation", 
               color = "Black", face = "bold", size = 14))

In this group, gender and partner do not influence churn percentage. Whether the customers are senior citizens and whether they are dependent or not are two most influential factors.

# PhoneService MultipleLines
phone_service <- ggplot(customer, aes(x=PhoneService, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no phone service", "Yes" = "have phone services")) +
                   labs(title="Phone Service", x="", y="Percentage") +
                   theme(legend.position="top")
multiple_lines <- ggplot(customer, aes(x=MultipleLines, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "Have no multiple lines", "Yes" = "Have multiple lines")) +
                   labs(title="Multiple lines", x="", y="Percentage") +
                   theme(legend.position="top")

PhoneInfo <- ggarrange(phone_service, multiple_lines)
annotate_figure(PhoneInfo, top = text_grob("Phone Infomation", 
               color = "Black", face = "bold", size = 14))

We can see that these two factors does not influence the churn percentage a lot, but the columns height are different. Therefore, they are not influential factors.

#InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
internet_service <- ggplot(customer, aes(x=InternetService, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Internet services", x="Internet company", y="Percentage") +
                   theme(legend.position="top")
online_security <- ggplot(customer, aes(x=OnlineSecurity, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Online security", x="", y="Percentage") +
                   theme(legend.position="top")
online_backup <- ggplot(customer, aes(x=OnlineBackup, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Online backup", x="", y="Percentage") +
                   theme(legend.position="top")
device_protection <- ggplot(customer, aes(x=DeviceProtection, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Device protection", x="", y="Percentage") +
                   theme(legend.position="top")
tech_support <- ggplot(customer, aes(x=TechSupport, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Tech support", x="", y="Percentage") +
                   theme(legend.position="top")

InternetInfo <- ggarrange(internet_service, online_security, online_backup,                                                device_protection, tech_support)
annotate_figure(InternetInfo, top = text_grob("Internet Information", 
               color = "Black", face = "bold", size = 14))

In the internet part, we can see these categorical variables all have influences on churn. However, online security and tech support are two remarkable factors. People with no online security and techn support will churn most.

#StreamingTV StreamingMovies
streaming_TV <- ggplot(customer, aes(x=StreamingTV, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Streaming tv", x="", y="Percentage") +
                   theme(legend.position="top")
streaming_movies <- ggplot(customer, aes(x=StreamingMovies, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Streaming movies", x="", y="Percentage") +
                   theme(legend.position="top")

StreamingInfo <- ggarrange(streaming_TV, streaming_movies)
annotate_figure(StreamingInfo, top = text_grob("Streaming Information", 
               color = "Black", face = "bold", size = 14))

For streaming information, we neglect customers who does not have internet service. Without middle column, we can see that they do not influence churn a lot.

#Contract PaperlessBilling PaymentMethod
contract <- ggplot(customer, aes(x=Contract, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   labs(title="Contract", x="Method", y="Percentage") +
                   theme(legend.position="top")
paperless_billing <- ggplot(customer, aes(x=PaperlessBilling, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("No" = "No paperless billing", "Yes" = "Have paperless billing")) +
                   labs(title="Paperless billing", x="", y="Percentage") +
                   theme(legend.position="top")
payment_method <- ggplot(customer, aes(x=PaymentMethod, fill=Churn)) +
                   geom_bar(position="fill") +
                   scale_fill_manual(values=c("pink3", "steelblue")) +
                   scale_x_discrete(labels=c("Bank transfer (automatic)" = "bank transfer", "Electronic check" = "E-check", "Credit card (automatic)" = "credit card")) +
                   labs(title="Payment method", x="Method", y="Percentage") +
                   theme(legend.position="top")

PaymentInfo <- ggarrange(contract, paperless_billing, payment_method)
annotate_figure(PaymentInfo, top = text_grob("Payment Information", 
               color = "Black", face = "bold", size = 14))

For the final part, they are all important factors. In payment method, we can see that the churn percentage of customers who choose to e-check is high. Maybe the online payment system is a trouble, and it influences the experience. Next, our team will introduce numeric variables.

2.1.2 kde plots for numeric variables

tenure_kdeplot <- ggplot(data = customer, aes(x = tenure, color = Churn)) + 
            geom_density(aes(fill = Churn), alpha = 0.8) + 
             scale_fill_manual(values=c("pink3", "steelblue")) +
              labs(title="KDEplot for tenure") +
               labs(x="tenure", y="density") +
                theme(legend.position="top")
tenure_kdeplot

MonthlyCharges_kdeplot <- ggplot(data = customer, aes(x = MonthlyCharges, color = Churn)) + 
                   geom_density(aes(fill = Churn), alpha = 0.8) + 
                    scale_fill_manual(values=c("pink3", "steelblue")) +
                     labs(title="KDEplot for MonthlyCharges") +
                      labs(x="MonthlyCharges", y="density") +
                       theme(legend.position="top")
MonthlyCharges_kdeplot

TotalCharges_kdeplot <- ggplot(data = customer, aes(x = TotalCharges, color = Churn)) + 
                   geom_density(aes(fill = Churn), alpha = 0.8) + 
                    scale_fill_manual(values=c("pink3", "steelblue")) +
                     labs(title="KDEplot for TotalCharges") +
                      labs(x="TotalCharges", y="density") +
                       theme(legend.position="top")
TotalCharges_kdeplot

lm1 <- glm(Churn~tenure, family = binomial(link = "logit"), data = customer)
lm2 <- glm(Churn~tenure + MonthlyCharges, family = binomial(link = "logit"), data = customer)
lm3 <- glm(Churn~tenure + MonthlyCharges + TotalCharges, family = binomial(link = "logit"), data = customer)
anovat <- anova(lm1,lm2,lm3, test="LRT")
anovat
## Analysis of Deviance Table
## 
## Model 1: Churn ~ tenure
## Model 2: Churn ~ tenure + MonthlyCharges
## Model 3: Churn ~ tenure + MonthlyCharges + TotalCharges
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1      7030       7176                         
## 2      7029       6382  1      794   <2e-16 ***
## 3      7028       6376  1        6    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
xkabledply(anovat)
Table
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
7030 7176 NA NA NA
7029 6382 1 794.37 0.0000
7028 6376 1 5.67 0.0173

Logistic regression is the appropriate regression analysis to predict a binary outcome (the dependent variable) based on a set of independent variables.

Then we compare AUC of model 2 and model 3

prob <- predict(lm2,customer, type = c("response"))
customer$prob <- prob
library(pROC)
g <- roc(Churn ~ prob, data = customer)
plot(g)

auc(customer$Churn, prob)
## Area under the curve: 0.808
prob1 <- predict(lm3,customer, type = c("response"))
customer$prob <- prob1
library(pROC)
g1 <- roc(Churn ~ prob1, data = customer)
plot(g1)

auc(customer$Churn, prob1)
## Area under the curve: 0.809

AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between customer with churn and no churn.

#summary(lm2)
xkabledply(lm2)
Table
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7909 0.0866 -20.7 0
tenure -0.0550 0.0017 -32.5 0
MonthlyCharges 0.0329 0.0013 25.3 0

As we can see from this summary, tenure has negative coefficient with churn and monthlycharges has positive coefficient with churn. It means when we have a customer with lower tenure and high monthlycharges having more probability to churn.

2.1.3 Simple correlations

Since most of variables are factors, it makes more sense to check their Spearman correlations.

customerNum = customer
# convert categorical variable as numeric for spearman method
for(i in 2:21){
  # tenure, MonthlyCharges, TotalCharges
  if (!(i %in% c(6, 19, 20))){
    customerNum[,i] = as.numeric(customerNum[,i])
  }
}
#str(customerNum)

# corrplot with spearman method for categorical variables
customercor <- cor(subset(customerNum, select=-c(customerID, prob)), method="spearman")
#customercor
loadPkg("corrplot")
#corrplot.mixed(customercor, tl.pos = "lt", number.cex = .5, tl.cex=0.8)
corrplot(customercor, type="lower", addCoef.col="black", number.cex=0.5, tl.cex=0.7,title="Telco Customer Churn Correlation", mar=c(0,0,1,0))

unloadPkg("corrplot")

Larger circle means higher correlation. We can see that churn has negative correlation with contract and tenure, which means that customer who stays longer with the company or has a longer contract terms is less likely to churn. Customer who signed up for online security service and has technical support plan is also less likely to churn. So it makes sense that contract and tech support have positive correlation, which means most customers who signed up for a technical support plan also have longer contract term.

2.1.4 violin and scatter plots for high correlated variables

library(patchwork)

tc1 <- ggplot(customer,aes(x = MultipleLines , y = TotalCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("MultipleLines") + ylab("TotalCharges")

tc2 <- ggplot(customer,aes(x = Contract , y = TotalCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Contract") + ylab("TotalCharges")

#tc3 <- ggplot(customer,aes(x = tenure , y = TotalCharges , fill = Churn)) +
#  geom_violin(alpha = 0.5, aes(linetype=NA)) +
#  xlab("Tenure") + ylab("TotalCharges")

tc3 <- ggplot(customer, aes(x=tenure, y=TotalCharges, color=Churn)) +
  geom_point(size=0.6,alpha=0.4)

tc3/ (tc1+tc2) + plot_annotation(title = 'TotalCharges plot')

The customers who have high total charge are likely to churn, no matter how long their tenure are.

The customers who have no phone service or no multiple lines, a month-to-month contract and a lower level total charge are more likely to churn.

#m1 <- ggplot(customer,aes(x = TotalCharges , y = MonthlyCharges , fill = Churn)) +
#  geom_violin(alpha = 0.5, aes(linetype=NA)) +
#  xlab("TotalCharges") + ylab("MonthlyCharges")

m1 <- ggplot(customer, aes(x=MonthlyCharges, y=TotalCharges, color=Churn)) +
  geom_point(size=0.1,alpha=0.4)

m2 <- ggplot(customer,aes(x = MultipleLines , y = MonthlyCharges , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("MultipleLines") + ylab("MonthlyCharges")

m1/m2 + plot_annotation(title = 'MonthlyCharges plot')

The customers who have high monthly charge are more likely to churn before their total charge increasing.

If the customers have no multiple lines service , they are likely to churn if their monthly charge are greater than 75 or between 30 and 50.

If they have no phone service, we will find an interesting results that the more monthly charge they have, the less customer will churn.

And if the customers have multiple lines service, they are like to churn when they have a monthly charge approximately greater than 70.

t1 <- ggplot(customer,aes(x = Partner , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Partner") + ylab("Tenure")

t2 <- ggplot(customer,aes(x = Contract , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Contract") + ylab("Tenure")

t3 <- ggplot(customer,aes(x = PaymentMethod , y = tenure , fill = Churn)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("PaymentMethod") + ylab("Tenure")

(t1+t2)/t3 + 
  plot_annotation(title = 'Tenure violinplot')

The customers who have no partner and short tenure are more likely to churn.

The customers who have a month-to-month contract and short tenure are more likely to churn. Also, when they have long tenure and a 2 year contract, their possibility of churn are obviously increased.

The payment method is also an factor. The customers who use electronic check or mailed check are more likely to churn if their tenure are short.

contract_jitter <- customerNum$Contract*50 +runif(length(customerNum$Contract), -10, 10)
OnlineSecurity_jitter <- customerNum$OnlineSecurity*50 +runif(length(customerNum$OnlineSecurity), -10, 10)
c1 <- ggplot(customer, aes(x=contract_jitter, y=OnlineSecurity_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

TechSupport_jitter <- customerNum$TechSupport*50 +runif(length(customerNum$TechSupport), -10, 10)
c2 <- ggplot(customer, aes(x=contract_jitter, y=TechSupport_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

OnlineBackupt_jitter <- customerNum$OnlineBackup*50 +runif(length(customerNum$OnlineBackup), -10, 10)
c3 <- ggplot(customer, aes(x=contract_jitter, y=OnlineBackupt_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

DeviceProtection_jitter <- customerNum$DeviceProtection*50 +runif(length(customerNum$DeviceProtection), -10, 10)
c4 <- ggplot(customer, aes(x=contract_jitter, y=DeviceProtection_jitter, color=Churn)) +
  geom_point(size=0.01,alpha=0.8)

(c1+c2)/(c3+c4)+plot_annotation(title = 'Contract Plot(wiht jittering)')

These 4 variables (Online Security & Tech Support & Online Backup & Device Protection) are probably influence, because we find that when customer do not have anyone of them, they are more likely to churn if they have a month-to-month contract, except Device Protection. The result shows that there are many people have Device Protection and month-to-month contract but still churn, even thought the amount of these customers are less than the one who have no Device Protection.